Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed Issue: #1977 #2181

Merged
merged 5 commits into from
Dec 9, 2024
Merged

Fixed Issue: #1977 #2181

merged 5 commits into from
Dec 9, 2024

Conversation

SSivakumar12
Copy link
Contributor

@SSivakumar12 SSivakumar12 commented Oct 14, 2024

First of all, I want to take the opportunity to thank the awesome work being done by Martin and all contributors both past and present. This is a really good tool that we have used at work and has provided us with a lot of meaningful insights so I figured that I could do my bit to contribute to a tool that has helped me previously :)

This is my first PR and I sincerely hope it is not my last especially within this repo since I really want to give back to this project.

What does this PR do?

Address the issue raised in issues #1977 which I believe has not been resolved yet.

Fixes # (issue)
I have added an edge case so that if a tokenizer isn't any of the expected patterns a ValueError is raised and produced a constructive error message to hopefully to prevent a repeat of the error. I did not update the documentation and add any tests. That being said, I am more than happy to add tests if deemed appropriate/valuable

Before submitting

  • This PR fixes a typo or improves the docs (if yes, ignore all other checks!).
  • Did you read the contributor guideline?
  • Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes (if applicable)?
    - didn't update documentation but added a error message
  • Did you write any new necessary tests?
    - no but happy to do so in a separate PR since I think it doesn't fall into the scope of this current PR?

@MaartenGr
Copy link
Owner

Thank you for the PR! Is this exception raised whenever the documents are actually truncated or when you instantiate the related LLM? If it is the former, you would get an error after the clustering has been done which seems like wasted resources for the user. Would it perhaps be an idea to raise this error earlier?

@SSivakumar12
Copy link
Contributor Author

Good point! In my case right now, the error would be raised when I instantiate the relevant LLM. I will look to try and find a suitable place to raise the earlier and amend accordingly.

@SSivakumar12
Copy link
Contributor Author

SSivakumar12 commented Oct 19, 2024

Looking at an example implementation for an representation class such as llamacpp, the purpose of truncate_document, is to prepare the prompt so that it could be sent to the LLM?

I might be wrong so I do apologise but wouldn't this mean the documents are truncated at representation model rather than at clustering. This means that we wouldn't be able to surface an issues after clustering. Therefore, would this mean the location of the function is reasonable?

@MaartenGr
Copy link
Owner

Looking at an example implementation for an representation class such as llamacpp, the purpose of truncate_document, is to prepare the prompt so that it could be sent to the LLM?

That is correct.

I might be wrong so I do apologise but wouldn't this mean the documents are truncated at representation model rather than at clustering. This means that we wouldn't be able to surface an issues after clustering. Therefore, would this mean the location of the function is reasonable?

I understand your reasoning. However, that would mean that this very basic issue only gets flagged after most of the computation (and potentially hours of training) has already been done. Moreover, this is something we could already know when you initialize a given LLM. So by moving this:

            raise ValueError(
                "Please select from one of the valid options for the `tokenizer` parameter: \n"
                "{'char', 'whitespace', 'vectorizer'} \n"
                "Alternatively if `tokenizer` is a callable ensure it has methods to encode and decode a document "
            )

To the __init__ of all LLM-based representations. We could do something like this:

if tokenizer is None and doc_length is not None:
  raise ValueError(
      "Please select from one of the valid options for the `tokenizer` parameter: \n"
      "{'char', 'whitespace', 'vectorizer'} \n"
      "Alternatively if `tokenizer` is a callable ensure it has methods to encode and decode a document "
  )

That will show the error the moment you actually create the LLM, so before any computation has been done and users can adjust accordingly.

@MaartenGr
Copy link
Owner

@SSivakumar12 Could you create the errors as a function that is imported from _utils.py? Now, we are duplicating the same code which makes it hard to manage the code.

"{'char', 'whitespace', 'vectorizer'} \n"
"If `tokenizer` is of type callable ensure it has methods to encode and decode a document \n"
)
_ = validate_truncate_document_parameters(self.tokenizer, self.doc_length)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you returning the _ for a specific reason? Looking at the validate_truncate_document_parameters, there doesn't seem anything that is returned.

Comment on lines +63 to +68
if tokenizer is None and doc_length is not None:
raise ValueError(
"Please select from one of the valid options for the `tokenizer` parameter: \n"
"{'char', 'whitespace', 'vectorizer'} \n"
"If `tokenizer` is of type callable ensure it has methods to encode and decode a document \n"
)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also include a check for the opposite? That someone does use a tokenizer but not a doc_length?

Comment on lines 73 to 74
else:
pass
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this isn't necessary right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it is not necessary but I included it to be verbose in terms of the possible edgecase. That being said, happy to remove it.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove it and I'll make sure to merge it after that! Other than that, it looks great!

@MaartenGr
Copy link
Owner

@SSivakumar12 Could you check the linting? It is failing there.

@SSivakumar12
Copy link
Contributor Author

Apologies, for the oversight, have resolved the linting issues locally and the build should pass on linting now

@MaartenGr
Copy link
Owner

Awesome, everything is looking good. Thank you for your work on this!

@MaartenGr MaartenGr merged commit 50d9a49 into MaartenGr:master Dec 9, 2024
6 checks passed
@MaartenGr MaartenGr mentioned this pull request Jan 3, 2025
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants